Gabriel Singer
23.02.2022
Multiple metric continuous independent predictors are used to predict one metric response variable.
Already met 2-way ANOVA and ANCOVA as models with >1 predictor.
Model formulation as a linear combination of k predictors: \[ Y=\beta_0+\beta_1X_1+\beta_2X_2+...+\beta_kX_k \] \(\beta_0\) is the intercept \(\beta_1...\beta_k\) are partial regression coefficients for \(X_1...X_k\), expressing the change in \(Y\) per unit change in a particular \(X_j\) holding all other \(X\)-variables constant (for instance, effect of \(X_1\) while adjusting for any effects of \(X_2...X_k\)).
\(\beta_0...\beta_k\) are estimated by \(b_0...b_k\) from the sample.
Using matrix notation:
\[ \hat{y} = \mathbf{X}\mathbf{B} \]
\(\mathbf{B}\) is the parameter vector
\(\mathbf{X}\) is the design matrix
\[ \mathbf{X} = \begin{bmatrix} 1 & x_{1,1} & x_{2,1} & \dots & x_{k,1} \\ 1 & x_{1,2} & x_{2,2} & \dots & x_{k,2} \\ \vdots & \vdots & \vdots & \dots & \vdots \\ 1 & x_{1,n} & x_{2,n} & \dots & x_{k,n} \\ \end{bmatrix} \]
\[ \begin{aligned} \hat{y} & = \mathbf{X}\mathbf{B} \\ y &= \mathbf{X}\mathbf{B} + \epsilon \\ \epsilon & \sim \mathcal{N}(0, s_\epsilon) \end{aligned} \]
The equation is a linear system and can be solved with linear algebra by OLS, minimizing the sum of squared residuals:
\[ \mathrm{min}: \sum \epsilon^2 = \sum \left (\mathbf{X}\mathbf{B} - y \right)^2 \]
Tested by ANOVA (as in simple LR): \(SS_{reg}\) and \(SS_{res}\) converted to \(MS_{reg}\) and \(MS_{res}\), then F-ratio of explained to residual variance to get P.
Tested with \(t\)-statistic (in output of lm) or by forming CIs for each partial regression coefficient and check overlap with 0. Alternatively: Use nested ANOVA models to compare fit of a full and a reduced model (without the predictor in question) and check for a significant change of \(SS_{reg}\) using a partial F-test.
Explained variance: multiple \(r^2\) (analogous to simple regression) = “the percentage of variation of Y explained by all X variables.
BUT: More predictors (even if random) always cause higher \(r^2\), therefore adjusted \(r^2\), which accounts for the number of predictors.
\[ {r^2}_{adj}=1-\frac{SS_{res}}{SS_{tot}}\cdot\frac{n-1}{n-k} \] The idea is to balance the increment in \(r^2\) with the additional complexity of the model. The cost of adding an additional predictor must be balanced by an actual gain in \(r^2\).
\[ {b_j}^*=b_j\frac{s_{X_j}}{s_Y} \]
Higher \({b_j}^*\) means stronger influence of \(X_j\). Note that in software output \({b_j}^*\) is often referred to as \(\beta_j\).
Classical tools to search for important predictors to include: forward and backward stepwise modelling with addition or dropping of predictors based on F or P-values (drop or add or step).
Model building: The extensive search to identify important predictors using leads away from strict hypothesis testing. Prediction quality counts more than significance testing, thus also insignificant predictors may be worthwhile to keep and extensive model building may happen.
Overfitting: In an overfitted model a predictor contributes to an increase of \(r^2\) by random noise in the specific dataset used to build the model.
The inclusion of irrelevant predictors:
Ways out:
Use \({r^2}_{adj}\) when judging model fit and comparing models: Large differences between \(r^2\) and \({r^2}_{adj}\) point to an overfitted model.
Use information criteria to compare models: The Akaike Information criterion (AIC) aims to identify a most parsimonious model, i.e. the model achieving the best fit with the lowest complexity (=number of predictors). Similarly to \({r^2}_{adj}\) it is computed by “punishing” a measure of model fit with a term involving the number of predictors k:
\[ AIC=n\cdot{\ln{SS_{res}}}+2(k+1)-n\cdot{\ln{n}} \] We aim for models with low AIC. Models with a \(\Delta{AIC}<2\) are considered equivalent.
Validation techniques:
Assume predictors \(X_1\) and \(X_2\) and limited sampling effort of n=16:
qqnorm(residuals(mod)), qqline(residuals(mod)), and plot(mod)cor on predictor matrix, check for large correlationsHighly correlating predictor variables cause model instability, large CIs for regression slopes, unsure importance of predictors (but could still be good overall model).
Assessing collinearity by VIF: Ignoring \(y\) for a moment, we can perform regressions of the \(x\) variables against each other:
\[ x_i = b_0 + b_1x_1 \dots b_kx_k +\epsilon \mathrm{~;~excluding~x_i} \]
Large \(R^2_i\) would argue for redundancy of \(x_i\) (its information is already contained in a linear combination of all other \(x\)-variables). \(VIF_i\) is a transformation of \(R^2_i\): \[ \mathrm{VIF}_i = \frac{1}{1-R^2_i} \]
Matt, please cast your equation of how VIF affects SE of regression coefficients into words. My explanation is likely not correct. I don´t understand the rho=0 in the equation.
The VIF (name!) tells you by how much the SE of a regression coefficient increases due to inclusion of additional \(x\)-variables:
\[ s_b = s_{b,\rho=0}\sqrt{\mathrm{VIF}} \]
Get VIF with car::vif(full_model) for all predictors of a particular model.
Geometrical representation / graphing:
# data from limnology: SPM in lakes
# required: MLR to predict SPM from other variables
data<-read.table(file="data/LakeSPM.txt",header=TRUE)[,-c(1:2)]
# remove first two variables (just lakename and # in dataset)
plot(data) # correlations? collinearity?data[,-c(5,10)]<-log(data[,-c(5,10)]) # transform all except pH and Vd
# dropping colinear variables using VIF
mlr.full<-lm(spm~.,data=data) # makes a model with all predictors
library(car)
## Loading required package: carData
vif(mlr.full)
## area Dm Dmax pH TP T
## 1.259843e+06 9.310628e+05 3.947461e+02 2.337318e+00 3.883131e+00 3.576987e+02
## Q DR Vd
## 5.275149e+02 1.034419e+06 4.269601e+01
# check high VIF for area
summary(lm(area~.-spm,data=data)) # -> r2 is 1!
##
## Call:
## lm(formula = area ~ . - spm, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.0053403 -0.0008962 0.0003460 0.0008821 0.0040917
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.894e-04 5.712e-02 -0.010 0.992
## Dm 2.005e+00 1.526e-02 131.403 <2e-16 ***
## Dmax -2.862e-03 9.258e-03 -0.309 0.761
## pH 1.243e-04 1.002e-03 0.124 0.903
## TP 7.409e-05 1.685e-03 0.044 0.965
## T -2.094e-04 3.995e-03 -0.052 0.959
## Q -1.008e-04 3.802e-03 -0.027 0.979
## DR 1.999e+00 7.636e-03 261.836 <2e-16 ***
## Vd 9.416e-04 8.395e-03 0.112 0.912
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.002301 on 17 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 2.677e+06 on 8 and 17 DF, p-value: < 2.2e-16
# continue dropping variables until VIF <5-10
del<-which(names(data) %in% c("area","Dm","Dmax"))
data<-data[,-del]
mlr.full<-lm(spm~.,data=data) # refit after variable deletion
# start with simple regression model
simple.mod1<-lm(spm~TP,data=data)
summary(simple.mod1)
##
## Call:
## lm(formula = spm ~ TP, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.90449 -0.28309 -0.08777 0.17917 1.20813
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.8724 0.5793 -6.685 6.48e-07 ***
## TP 1.5517 0.1891 8.206 2.00e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5089 on 24 degrees of freedom
## Multiple R-squared: 0.7373, Adjusted R-squared: 0.7263
## F-statistic: 67.35 on 1 and 24 DF, p-value: 2.004e-08
plot(spm~TP,data=data)
abline(simple.mod1,col="red")
# forward stepwise MLR building using F
# start with TP and see if other terms can be added to the model
names(data)
## [1] "spm" "pH" "TP" "T" "Q" "DR" "Vd"
add1(simple.mod1,scope=~pH+TP+T+Q+DR+Vd,test="F")
## Single term additions
##
## Model:
## spm ~ TP
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 6.2163 -33.204
## pH 1 2.02398 4.1923 -41.446 11.1041 0.002896 **
## T 1 0.41143 5.8048 -32.985 1.6302 0.214421
## Q 1 0.00925 6.2070 -31.243 0.0343 0.854777
## DR 1 1.96782 4.2484 -41.100 10.6533 0.003414 **
## Vd 1 0.79150 5.4248 -34.745 3.3558 0.079950 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# adds single terms and compares models with an F-test
mlr1<-lm(spm~TP+pH,data=data)
summary(mlr1)
##
## Call:
## lm(formula = spm ~ TP + pH, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.86284 -0.24187 -0.04459 0.18318 0.91888
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.5089 0.9285 -7.010 3.83e-07 ***
## TP 1.4664 0.1607 9.127 4.16e-09 ***
## pH 0.4105 0.1232 3.332 0.0029 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4269 on 23 degrees of freedom
## Multiple R-squared: 0.8228, Adjusted R-squared: 0.8074
## F-statistic: 53.4 on 2 and 23 DF, p-value: 2.275e-09
# -> good model, try adding more terms
# backward stepwise MLR building using F
# start with full model (all terms) and see if terms can be dropped
summary(mlr.full)
##
## Call:
## lm(formula = spm ~ ., data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.92577 -0.11178 0.02268 0.20035 0.76549
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.60757 1.29121 -2.794 0.01158 *
## pH 0.25790 0.16681 1.546 0.13860
## TP 1.02714 0.26827 3.829 0.00113 **
## T -0.04065 0.05496 -0.740 0.46863
## Q -0.02767 0.04319 -0.641 0.52947
## DR 0.36796 0.12515 2.940 0.00840 **
## Vd 0.31319 0.32422 0.966 0.34619
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3871 on 19 degrees of freedom
## Multiple R-squared: 0.8797, Adjusted R-squared: 0.8417
## F-statistic: 23.15 on 6 and 19 DF, p-value: 8.788e-08
drop1(mlr.full,test="F")
## Single term deletions
##
## Model:
## spm ~ pH + TP + T + Q + DR + Vd
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 2.8466 -43.511
## pH 1 0.35810 3.2047 -42.430 2.3901 0.138597
## TP 1 2.19631 5.0429 -30.643 14.6594 0.001133 **
## T 1 0.08194 2.9286 -44.773 0.5469 0.468633
## Q 1 0.06147 2.9081 -44.955 0.4103 0.529466
## DR 1 1.29509 4.1417 -35.762 8.6441 0.008404 **
## Vd 1 0.13980 2.9864 -44.264 0.9331 0.346193
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# -> drop Q (lowest F, highest P, also makes lowest AIC)
mlr2<-lm(spm~pH+TP+T+DR+Vd,data=data)
summary(mlr2)
##
## Call:
## lm(formula = spm ~ pH + TP + T + DR + Vd, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.88293 -0.20247 0.00856 0.23363 0.81997
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.80213 1.23633 -3.075 0.00597 **
## pH 0.21254 0.14880 1.428 0.16861
## TP 0.98497 0.25620 3.844 0.00101 **
## T -0.03584 0.05364 -0.668 0.51164
## DR 0.35463 0.12158 2.917 0.00853 **
## Vd 0.42131 0.27270 1.545 0.13804
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3813 on 20 degrees of freedom
## Multiple R-squared: 0.8771, Adjusted R-squared: 0.8464
## F-statistic: 28.54 on 5 and 20 DF, p-value: 1.869e-08
# continue until no more dropping...
# automatic stepwise search based on AIC
step(mlr.full,direction="backward")
## Start: AIC=-43.51
## spm ~ pH + TP + T + Q + DR + Vd
##
## Df Sum of Sq RSS AIC
## - Q 1 0.06147 2.9081 -44.955
## - T 1 0.08194 2.9286 -44.773
## - Vd 1 0.13980 2.9864 -44.264
## <none> 2.8466 -43.511
## - pH 1 0.35810 3.2047 -42.430
## - DR 1 1.29509 4.1417 -35.762
## - TP 1 2.19631 5.0429 -30.643
##
## Step: AIC=-44.96
## spm ~ pH + TP + T + DR + Vd
##
## Df Sum of Sq RSS AIC
## - T 1 0.06492 2.9730 -46.381
## <none> 2.9081 -44.955
## - pH 1 0.29667 3.2048 -44.430
## - Vd 1 0.34706 3.2552 -44.024
## - DR 1 1.23717 4.1453 -37.739
## - TP 1 2.14909 5.0572 -32.569
##
## Step: AIC=-46.38
## spm ~ pH + TP + DR + Vd
##
## Df Sum of Sq RSS AIC
## <none> 2.9730 -46.381
## - pH 1 0.2469 3.2200 -46.307
## - Vd 1 0.2913 3.2643 -45.951
## - DR 1 1.1823 4.1554 -39.676
## - TP 1 4.5877 7.5608 -24.113
##
## Call:
## lm(formula = spm ~ pH + TP + DR + Vd, data = data)
##
## Coefficients:
## (Intercept) pH TP DR Vd
## -3.8703 0.1878 1.0959 0.3433 0.3709
step(simple.mod1,scope=~pH+TP+T+Q+DR+Vd,direction="forward")
## Start: AIC=-33.2
## spm ~ TP
##
## Df Sum of Sq RSS AIC
## + pH 1 2.02398 4.1923 -41.446
## + DR 1 1.96782 4.2484 -41.100
## + Vd 1 0.79150 5.4248 -34.745
## <none> 6.2163 -33.204
## + T 1 0.41143 5.8048 -32.985
## + Q 1 0.00925 6.2070 -31.243
##
## Step: AIC=-41.45
## spm ~ TP + pH
##
## Df Sum of Sq RSS AIC
## + DR 1 0.92796 3.2643 -45.951
## <none> 4.1923 -41.446
## + Vd 1 0.03692 4.1554 -39.676
## + Q 1 0.02363 4.1686 -39.593
## + T 1 0.00252 4.1898 -39.462
##
## Step: AIC=-45.95
## spm ~ TP + pH + DR
##
## Df Sum of Sq RSS AIC
## + Vd 1 0.291287 2.9730 -46.381
## <none> 3.2643 -45.951
## + Q 1 0.225708 3.0386 -45.814
## + T 1 0.009151 3.2552 -44.024
##
## Step: AIC=-46.38
## spm ~ TP + pH + DR + Vd
##
## Df Sum of Sq RSS AIC
## <none> 2.9730 -46.381
## + T 1 0.064924 2.9081 -44.955
## + Q 1 0.044461 2.9286 -44.773
##
## Call:
## lm(formula = spm ~ TP + pH + DR + Vd, data = data)
##
## Coefficients:
## (Intercept) TP pH DR Vd
## -3.8703 1.0959 0.1878 0.3433 0.3709
# same solutions! compare with selection by F!
# combined forward and backward selection based on AIC
step(mlr.full,direction="both")
## Start: AIC=-43.51
## spm ~ pH + TP + T + Q + DR + Vd
##
## Df Sum of Sq RSS AIC
## - Q 1 0.06147 2.9081 -44.955
## - T 1 0.08194 2.9286 -44.773
## - Vd 1 0.13980 2.9864 -44.264
## <none> 2.8466 -43.511
## - pH 1 0.35810 3.2047 -42.430
## - DR 1 1.29509 4.1417 -35.762
## - TP 1 2.19631 5.0429 -30.643
##
## Step: AIC=-44.96
## spm ~ pH + TP + T + DR + Vd
##
## Df Sum of Sq RSS AIC
## - T 1 0.06492 2.9730 -46.381
## <none> 2.9081 -44.955
## - pH 1 0.29667 3.2048 -44.430
## - Vd 1 0.34706 3.2552 -44.024
## + Q 1 0.06147 2.8466 -43.511
## - DR 1 1.23717 4.1453 -37.739
## - TP 1 2.14909 5.0572 -32.569
##
## Step: AIC=-46.38
## spm ~ pH + TP + DR + Vd
##
## Df Sum of Sq RSS AIC
## <none> 2.9730 -46.381
## - pH 1 0.2469 3.2200 -46.307
## - Vd 1 0.2913 3.2643 -45.951
## + T 1 0.0649 2.9081 -44.955
## + Q 1 0.0445 2.9286 -44.773
## - DR 1 1.1823 4.1554 -39.676
## - TP 1 4.5877 7.5608 -24.113
##
## Call:
## lm(formula = spm ~ pH + TP + DR + Vd, data = data)
##
## Coefficients:
## (Intercept) pH TP DR Vd
## -3.8703 0.1878 1.0959 0.3433 0.3709
mlr3<-lm(spm~pH+TP+DR+Vd,data=data)
AIC(mlr3)
## [1] 29.40346
extractAIC(mlr3)
## [1] 5.00000 -46.38135